perm filename VIS[00,BGB]3 blob
sn#111643 filedate 1974-07-17 generic text, type C, neo UTF8
COMMENT ⊗ VALID 00016 PAGES
C REC PAGE DESCRIPTION
C00001 00001
C00003 00002 {[C<NαVISION THEORY.λ30P60I425,0JCFA} SECTION 6.
C00005 00003 [6.1 A Geometric Feedback Vision System.]
C00008 00004 Descriptive vision, bottom up or data driven vision involves
C00011 00005
C00016 00006 [6.2 Vision Tasks.]
C00019 00007 First, there is the robot chauffer task. In 1969, John
C00023 00008
C00028 00009
C00032 00010 [6.3 Vision System Design Arguments.]
C00039 00011
C00043 00012 [6.4 Mobile Robot Vision.]
C00048 00013
C00053 00014
C00057 00015 [6.5 Related Vision Work.]
C00063 00016 [6.6 Summary.]
C00066 ENDMK
C⊗;
{[C;<N;αVISION THEORY.;λ30;P60;I425,0;JCFA} SECTION 6.
{JCFD} COMPUTER VISION THEORY.
{λ10;W250;JAFA}
6.0 Introduction to Computer Vision Theory.
6.1 A Geometric Feedback Vision System.
6.2 Vision Tasks.
6.3 Vision System Design Arguments.
6.4 Mobile Robot Vision.
6.5 Related Vision Work.
6.6 Summary.
{λ30;W0;I950,0;JUFA}
[6.0 Introduction to Computer Vision Theory.]
Computer vision concerns programming a computer to do a task
that demands the use of an image forming light sensor such as a
television camera. The theory I intend to elaborate is that general
3-D vision is a continuous process of keeping an internal visual
simulator in sync with perceived images of the external reality so
that vision tasks can be done by reference to the simulator's model
rather than by reference to the original images. The word <theory>,
as used here, means simply a set of statements presenting a
systematic view of a subject. Specifically, I wish to exclude the
connotations that the theory is a mathematical theory or a natural
theory. Perhaps there can be such a thing as an <artificial theory>
which extends from the philosophy thru the design of an artifact. {Q}
[6.1 A Geometric Feedback Vision System.]
Vision systems mediate between images and world models, these
two extremes of a vision system are called, in the jargon, the
<bottom> and the <top> respectively. In what follows, the word
<image> will be used only to refer to the notion of a 2-D data
structure representing a picture. A picture is a rectangle taken
from the pattern of light formed by a thin lens on the nearly flat
photoelectric surface of a television camera's vidicon. A sequence
of images in time will be called a film. On the other hand, a <world
model> is a data structure which is supposed to represent the
physical world for the purposes of a task processor. In particular,
the main point of this thesis concerns isolating a portion of the
world model (called the 3-D geometric world model) and placing it
below most of the other entities that a task processor has to deal
with. The vision hierarcy, so formed, is illustrated in the box
(6.1).
{|λ10;JA}
BOX 6.1 {JC} VISION SYSTEM HIERARCY.
{JC} Task Processor
{JC} |
{JC} Task World Model
The Top → {JC} |
{JC} 3-D Geometric Model
{JC} |
The Bottom → {JC} 2-D Images
{|λ30;JU}
In the gap between the top and the bottom, between
images and the task world model, a
general vision system has three distinguishable modes of
operation: recognition, verification and description. Recognition
vision can be characterized as bottom. What is in
the picture is determined by extracting a set of features from the
image and by classifing them.
Verification vision, also called top down or model driven vision,
involves predicting an image, followed by comparing the predicted
image and a perceived image for differences which are expected but
not yet measure.
Descriptive vision, bottom up or data driven vision involves
converting the image into a representation that makes it possible (or
easier) to do the desired vision task. I would like to call this kind
of vision "revelation vision" at times; although the phrase
"descriptive vision" is the term used by most members of the computer
vision community.
{|λ10;JU;FA}
Box 6.2 {JC} THREE BASIC MODES OF VISION.
1. Recognition Vision - Feature Classification. (bottom up into an existing top).
2. Verification Vision - Model Driven Vision. (nearly pure top down vision).
3. Descriptive Vision - Data Driven Vision. (nearly pure bottom up vision).
{|λ30;JU}
Now we have enough pieces to outline a system design.
By placing a 3-D geometric model in the gap; recognition vision can
be could be done on 3-D (rather than 2-D) features into the task world model;
and description vision and verification vision could be used to link
the 2-D and 3-D models in a relatively dumb, mechanical fashion.
Previous attempts to use recognition vision, to bridge directly the
large gap between 2-D images (of 3-D objects) and the task world
model, have been frustrated because the characteristic 2-D image
features of a 3-D object are very dependent on the 3-D physical
processes of occultation, rotation and illumination. It is these
processes that will have to be modeled and understood before the
features relevant to the task processor can be deduced from the
perceived images. The arrangement of these elements is diagramed
below.
{|λ10;JA}
Box 6.3 {JC} A SIMPLE VISION SYSTEM DESIGN.
{JC} Task World Model
{JC} ↑
{JC} RECOGNITION
{JC} ↑
{JC} 3-D geometric model
{JC} ↑ ↓
{JC} DESCRIPTION VERIFICATION
{JC} ↑ ↓
{JC} 2-D images
{|λ30;JU}
I wish to call attention to the lower part of the above diagram;
this portion is the feedback loop of the 3-D geometric vision system.
Depending on circumstances, the vision system should be able to run
almost entirely top-down (verification vision) or bottom-up
(revelation vision). Verification vision is all that is required in a
well know predictible environment; whereas, revelation vision is
required in a brand new (tabula rasa) or rapidly changing
environment. Thus revelation and verification form a loop,
bottom-up and top down. First, there is some kind of revelation that
forms (or selects) a 3-D model; and second, the model is verified by
testing image features predicted from the assumed model. This loop
like structure has been noted before by others; it is a form of what
Tenebaum(71) called <accomodation> and it is a form of what Falk(69)
called <heuristic vision>; however I will go along with what I think
is the current majority of vision workers who call it <visual feedback>.
Completing the design, the images and worlds are
constructed, manipulated and compared by a variety of processors. The
topmost of which is the task processor. Since the task processor is
expected to vary with the application; it would be expedient if it
could be isolated as a user program calling on utility routines
of an appropriate vision sub-system. Immediately below the task
processor are the 3-D recognition routines and the 3-D modeling
routines. The modeling routines underlie most everything; because
they are used to create, alter and access the models.
{|;λ10;JAFA}
Box 6.4 {JC} Basic Kinds of Vision Processors in a 3-D Vision System.
{↓} 0. The task processor.
1. 3-D recognition.
2. 3-D modeling routines.
3. Reality simulator.
{↑;w560;} 4. Image analyser.
5. Image synthesizer.
6. Locus solvers.
7. Comparators: 2D and 3D.
{|;λ30;JUFA}
The remaining processors include the reality simulator which
does mechanics for modeling motion, collision and gravity.
Also there are image analyzers, which do image enhancement and
conversions such as converting video rasters into line drawings.
There is an image synthesizer, which does hidden line and surface
elimination, for verification by comparing synthetic images from the
model with perceived images of reality. There are three kinds of
locus solvers that compute numerical descriptions for cameras, light
sources and physical objects. Finally, there is of course a large
number (at least ten) different compare processors for confirming or
denying correspondences among entities in each of the different kinds
of images and 3-D models.
[6.2 Vision Tasks.]
The 3-D vision research problem being discussed is that of
finding out how to write programs that can see in the real world.
Related vision problems include: modeling human
perception, solving visual puzzles (non-real world), and developing
advanced automation techniques (ad hoc vision). In order to approach
the problem, specific programming tasks are proposed and solutions
are sought; however please distingush the idea of a research problem
from that of a programming task. As will be illustrated, many vision
tasks can be done without vision. The vision solution to be found
should be able to deal with real images, should include the
continuity of the visual process in time and space, and should be
general purpose rather than ad hoc. These three requirements
(reality, continuity, generality) will be developed by surveying
six examples of computer vision tasks.
{|;λ10;JAFA}
BOX 6.5{JC} TABLE OF 3-D COMPUTER VISION TASKS.
1. The Robot Chauffeur. Cart Task.
2. The Robot Explorer. Cart Task.
3. The Robot Soldier. Cart Task.
4. Turn Table Task.
5. The Blocks Task.
6. Machine Assembly Tasks.
{|;λ30;JUFA}
First, there is the robot chauffer task. In 1969, John
McCarthy asked me to consider the vision requirements of a computer
controlled car such as he depicted in an unpublished essay. The idea
is that a user of such an automatic car would request a destination;
the robot would select a route from an internally stored road map;
and it would then proceed to its destination using visual data. The
problem involves representing the road map in the computer and
establishing the correspondence between the map and the appearance of
the road as the automatic chauffer drived the vehicle along the
selected route. Lacking a computer controlled car, the problem was
abstracted to that of tracing a route along the driveways and parking
lots that surround the Stanford A.I. Laboratory using a television
camera and transmitter mounted on a radio controlled electric cart.
The robot chauffer task could be solved by non-visual means such as
by railroad like guidance or by inertial guidance; to preverse the
vision aspect of the problem, no particular artifacts should be
required along a route (landmarks must be found, not placed); and the
extent of inertial dead reckoning should be noted.
Second, there is the task of a robot explorer. In 1967,
McCarthy and Lederberg, published a description of a robot for
exploring the surface of the planet Mars. The robot explorer was
required to run for long periods of time without human intervention
because the signal transmission time to Mars is as great as twenty
minutes and because the 23.5 hour Martian day would place the vehicle
out of Earth sight for twelve hour at a time. (This latter difficulty
could be avoided at the expense of having a set of communication
relay satellites in orbit around Mars). The task of the explorer
would be to drive around mapping the surface of Mars, looking for
interesting features, and doing various experiments. To be prudent,
a Mars explorer should be able to navigate without vision; this can
be done by driving slowly and by using a tactile collision and
crevasse detector. If the television system fails, the core samples
and so on can still be collected at different Martian sites without
unusual risk to the vehicle due to visual blindness.
The third vision task is that of the robot soldier, tank,
sentry, pilot or policeman. The problem has several forms which are
quite similar to the chauffeur and the explorer with the additional
goal of doing something nasty to an enemy. Although this vision task
has not yet been explicitly attempted at Stanford, to the best of my
knowledge, the reader should be warned that a thorough solution to
any of the other tasks almost assures the Orwellian technology to
solve this one.
Fourth, the turn table task is to construct a 3-D model from
a sequence of 2-D television images taken of an object rotated on a
turn table. The turntable task was selected as a simplification of
the explorer task and is an example of a nearly pure desriptive
vision task.
Fifth, the classic blocks vision task consists of two parts:
first convert a video image into a line drawing; second, make a
selection from a set of predefined prototype models of blocks that
accounts for the line drawing. In my opinion, this vision task
emphasives three pitfalls: single image vision, line drawings and
blocks. The greatest pitfall, in the usual blocks vision task, is the
presumption that a single image is to be solved; thus diverting
attention away from the most important depth perception mechanism
which is parallax. The second pitfall is that the usual notion of a
perspective line drawing is not a natural intermediate state; but is
rather a very sophisticated and platonic geometric idea. The perfect
line drawing lacks photometric information; even a line drawing with
perfect shadow lines included will not resemble anything that can
readily be gotten by processing real television pictures. Curiously,
the lack of success in deriving line drawings from real television
images of real blocks has not dampened interest in solving the second
part of the problem. The perfect line drawing puzzle, was first
worked on by Guzman and extended to perfect shadows by Waltz;
nevertheless, enough remains so that the puzzle will persist on its
own merits, without being closely relevant to real world computer
vision. Even assuming that imperfect line drawings are given, the
final unreality of the blocks themselves, have seduced such
researchers as Falk and Grape to build byzantine systems of
vertex-edge classification which almost certainly can not be extended
beyond the blocks domain. Actually, the blocks would not be such a
bad research simplification, if researchers could avoid getting hung
up in the fact that they have edges and vertices, but concentrate
instead on where the block are and on how they scatter light. The
blocks task can be rehabilitated by requiring photometric modeling
and by requiring the use multiple images for depth perception.
Sixth, the Stanford Hand Eye Project has recently dedicated
itself to solving the task of automatic machine assembly. In
particular, the group will try to develope techniques that will be
demonstrated by the fully automatic assembly of a chain saw gasoline
engine. The two pressing vision questions of machine assemble are
where is the part and where is the hole; these questions will be
initially handled by composing ad hoc part and hole detectors for
each vision step required for the assembly.
The point of this task survey was to sharpen our taste for
what is and is not a task requiring real 3-D vision; and to point out
that caution has to be taken to preserve the vision aspects of a
given task. In the usual course of vision projects, a single task or
a single tool unfortunately dominates the research; my work is no
exception, the one tool is 3-D modeling, and the task that dominated
the formative stages of the research is that of the robot chauffer
cart. A better understanding of the ultimate nature of computer
vision can be obtained by keeping the several tasks and the several
tools in mind.
[6.3 Vision System Design Arguments.]
The physical information most directly relevant to vision is
the location, extent and light scattering properties of solid opaque
objects; the location, orientation and scales of the camera that
takes the pictures; and the location and nature of the light that
illuminates the world. The transformation rules of the everyday
world that a programmer may assume, a priori, are the laws of
physics. The arguments against geometric modeling, divide
into two catagories: the reasonable and the intuitive.
The reasonable arguments attack 3-D geometric modeling by
comparing it to another modeling alternative, (some alternatives are
listed in the box immediately below). Actually, the domains
of efficiency of the possible kinds of models do not
greatly overlap; and an artificial intellect will have some portion of
each kind. Nevertheless, I feel that 3-D geometric modeling is
superior for the task at hand, and that the other models are less
relevant to vision.
{|;λ10;JAFA}
BOX 6.6{JV} Alternatives to 3-D Geometric Modeling in a Vision System.
1. Image memory and with only the camera model in 3-D.
2. Statistical world model, e.g. Duda & Hart.
3. Procedural Knowledge, e.g. Hewett & Winograd.
4. Semantic knowledge, e.g. Wilkes & Shank.
5. Formal Logic models, e.g McCarthy & Hayes.
6. Syntactic models.
{|;λ30;JUFA}
The best alternative to a 3-D geometric model is to have a
library of little 2-D images describing the appearance of various 3-D
loci from given directions. The advantage would be that a
sophisticated image predictor would not be required; on the other
hand the image library is potentially quite large and that even with
a huge data base new views and lighting of familair objects and
scenes can not be anticipated.
The statistical model, is quite relevant to vision and can be
added to the geometric model. However, the statistical model can not
stand alone because the processes of occultation, rotation and
illumination make the approach infeasible.
Procedural knowledge models represent the world in terms of
routines (or actors) which either know or can compute the answer to a
question about the world. Semantic models represent the world in term
of a data structure of conceptual statements; and formal logic models
represent the world in terms of first order predicate calculus or in
terms of a situation calculus. The procedural, semantic and formal
logic world models are of course all general enough to represent a
vision model and in a theoretical sense they are just other notations
for 3-D geometric modeling. However in practice, these three
modeling regimes are not efficient holders and handlers of
quantitative geometric data; but are rather intended for a higher
level of abstract reasoning. Another alleged advantage of these
higher models is that they can represent partial knowledge and
uncertainty, which in a geometric model is implicit, in that
structures are missing or incomplete. For example, McCarthy and
Feldman demand that when a robot has only seen the front of an office
desk that the model should be able to draw inferences about the back
of the desk; I feel that this so called advantage is not required by
the problem and that basic visual modeling is on a more agnostic
level.
The syntactical approach to descriptive vision is that an
image is a sentence of a picture grammar and that consquently the
image description should be given in terms of the sequence of grammar
transformations rules. Again this paradigm is theoretically true but
impractical for real images of 3-D objects because simple
replacements rules can not readily express rotation, perspective,
and photometric transformations. On the other hand, the syntactical
models have been of some use in describing 2-D shapes. (Gipps, 74).
The intuitive arguments include the opinions that geometric
modeling is too numerical, too exact, or too non-human to be relevant
for computer vision research. Since, I suffer a lack of sympathy
for these positions, I will forsake any pretext of objectivity and
attack fallacies.
The natural mimicry fallacy is that it is false to insist
that a machine must mimic nature in order to achieve its design
goals. Boeing 747's are not covered with feathers; trucks do not have
legs; and computer vision need not simulate human vision. The
advocates of the uniqueness of natural intellegence and perception
will have to come up with a rather unusual uniqueness proof to establish their
conjecture. In the meantime, one should be open minded about the
potential forms a perceptive counsciousness can take.
The self introspection fallacy is that it is false to insist
that one's introspections about how he thinks and sees are direct
observations of thought and sight. By introspection some conclude
that the visual models (even on a low level) are essentially
qualitative rather than quantative. My belief is that the vision
processing of the brain is quite quantitative and only passes into
qualities at a higher level of the process. In either case, the
details of human visual processing are inaccessible to conscious self
inspection.
Although, I think that the above two fallacies of intuitition
generate an anti numerical model prejudice, convincing a person of
these fallacies doesn't seem to remove his doubts. Some important
argument or idea is missing that would convince the so prejudiced
potential vision worker of the importance of numerical models prior
to the full achievement of computer vision (vice versa, I have not
heard an argument that would change my prejudice in favor of such
models). This matter of conflicting intuitions would not be
important, were it not that the "they" include so many of my
immediate collegues. (Of course, I may well be proved wrong
if really powerful 3-D computer vision systems are ever built
without using any geometric models worth speaking of).
[6.4 Mobile Robot Vision.]
The elements discussed so far will now be brought together
into a system design for performing mobile robot vision. The proposed
system is illustrated below in the block diagram in box (6.X). (The
diagram is called a mandala in that
a <mandala> is any circle-like system diagram). Although, the robot
chauffered cart was the main task theme of this research; I have
failed to date, March 1974, to achieve the hardware and software
required to drive the cart around the laboratory under its own
control. Nevertheless, this necessarily theoretical cart system has
been of considerible use in developing the visual 3-D modeling
routines and theory, which are the subject of this thesis.
{|;λ4;JV;FA}
BOX 6.7{JC} CART VISION MANDALA.
{W250;F2}
→→→→→→→→→→→→→→→→→→→ PERCEIVED →→→→→→ REALITY →→→→→→ PREDICTED →→→→
↑ WORLD SIMULATOR WORLD ↓
↑ ↓
↑ ↓
↑ PERCEIVED →→→→→→ CART →→→→→→→→ PREDICTED →→→↓
↑ CAMERA LOCUS DRIVER CAMERA LOCUS ↓
↑ ↑ ↓ ↓
↑ ↑ ↓ ↓
↑ ↑ THE CART PREDICTED→→→→↓
BODY CAMERA SUN LOCUS ↓
LOCUS LOCUS ↓
SOLVER SOLVER ↓
↑ ↑ ↓
↑ ↑ ↓
REVEAL VERIFY IMAGE
COMPARE COMPARE SYNTHESIZER
↑ ↑ ↑ ↑ ↓
↑ ↑ ↑ ↑ ↓
↑ ←← PERCEIVED→→→→→↑ ↑←←←←←←←←←←←←←←←←←←←← PREDICTED ←←←←←←←↓
←←←←← MOSAIC IMAGE MOSAIC IMAGE ↓
↑ ↑ ↓
↑ ↑ ↓
↑ ↑ ↓
PERCEIVED PREDICTED ↓
CONTOUR IMAGE CONTOUR IMAGE ↓
↑ ↑ ↓
↑ ↑ ↓
↑ ↑ ↓
PERCEIVED PREDICTED ←←←←←←←←←
VIDEO IMAGE VIDEO IMAGE
↑
↑
↑
TELEVISION
CAMERA
{|;λ30;JUFA}
The robot chauffer task involves establishing the
correspondence between an internal road map and the appearance of the
road in order to steer a vehicle along a predefined path. For a first
cut, the planned route is assumed to be clear, and the cart and the
sun are assumed to be the only movable things in a static world.
Dealing with moving obstacles is a second problem, motion thru a
static world must be dealt with first.
The cart at the Stanford Artificial Intelligence Laboratory
is intended for outdoors use and consists of a piece of plywood, four
bicycle wheels, six electric motors, two car batteries, a television
camera, a television transmitter, a box of digital logic, a box of
relays, and a toy airplane radio receiver. (The vehicle being
discussed is not "Shakey", which belongs to the Stanford Reseach
Institute's Artificial Intelligence Group. There are two A.I. labs
near Stanford and each has a computer controlled vehicle). The six
possible cart actions are: run forwards, run backwards, steer to the
left, steer to the right, pan camera to the left, pan camera to the
right. Other than the television camera, there is no telemetry
concerning the state of the cart or its immediate environment.
The solution to the cart problem, begins with the cart at a
known starting position with a road map of visual landmarks with
known loci. That is, the upper leftmost two rectangles of the cart
mandala are initialized so that the perceived cart locus and the
perceive world correspond with reality. Flowing across the top of
the mandala, the cart driver, blindly moves the cart forward along
the desired route by dead reckoning (say the cart moves five feet and
stops) and the driver updates the predicted cart locus. The reality
simulator is an identity in this simple case because the world is
assumed static. Next the image synthesizer uses the predicted world,
camera and sun to compute a predicted image containing the landmarks
features expected to be in view. Now, in the lower left of the
mandala, the cart's television camera takes a perceived picture and
(flowing upwards) the picture is converted into a form suitable for
comparing and matching with the predicted image. Features that are
both predicted and perceived and found to match are used by the
camera locus solver to compute a new perceived camera locus (from
which the cart locus can be deduced). Now the cart driver compares
the perceived and the predicted cart locus and corrects its course
and moves the cart again, and so on.
{|;λ10;JAFA}
BOX 6.8 Chauffer Cart Task Solution.
1. Predict (or retrieve) 2D image features.
2. Perceive (take) a television picture and convert.
3. Compare (verify) predicted and perceived features.
4. Solve for camera locus.
5. Servo the cart along its intended course.
{|;λ30;JUFA}
The remaining limb of the cart mandala is invoked in order to
turn the chauffer into an explorer. Perceived images are compared
thru time by the reveal compare and new features are located by the
body locus solver and placed into the world model.
Now the generality and feasibility of such a cart system
depends almost entirely on the representation of the world and the
representation of image features. (The more general, the less
feasible). Although, the bulk of the rest of this document developes
polyhedral representation for the sake of photometric generality;
four simpler cart systems could be realized by using simpler models.
A first system, consists of a road map, a road model, a road
model generator, a solar emphemeris, an image predictor an image
comparator, a camera locus solver, and a course servo routine. The
roadways and nearby environs are entered into the computer. In fact,
real roadways are constructed from a two dimensional X,Y allignment
map showing way the center of the road goes as a curve composed of
line segement and circular arcs; and a second two dimensional S,Z
elevation diagram; showing the height of the surface above sea level
as a funtion of distance along the road; as a sequence of linear
grades and vertical arcs which (not too surprising) are nearly cubic
splines. A second version, is like the first except the road model,
road model generator, and image predictor are replaced by a library
of road images. In this system the robot vehicle is "trained" by
being driven down the roads it is suppose to follow. A third system
is like the first except that the road map is not initially given,
and indeed the road is no longer presumed to exist. Part of the
problem becomes finding a road, a road in the sense of a clear area;
this version yeilds the cart explorer and if the clear area is found
quite rapidly and the world is updated quite frequently, the explorer
can be a chauffer that can handle obstacles and moving objects. The
fourth system is like the third, except that the world is modeled by
a single valued surface elevation function, rather than by a
polyhedral model.
[6.5 Related Vision Work.]
Larry Roberts is justly credited for doing the seminal work
in 3-D Computer Vision; although his thesis appeared over ten years
ago the subject has languished dependent on and overshadowed by the
four areas called: Image Processing, Pattern Recognition, Computer
Graphics, and Artificial Intelligence. Outside the computer
sciences, workers in psychology, neurology and philosophy also seek a
theory of vision.
IMAGE PROCESSING involves the study and development of
programs that enhance, transform and compare 2D images. Nearly all
image processing work can eventually be applied to computer vision in
various circumstances. A good survey of this field can be found in an
article by Rosenfeld(69). Image PATTERN RECOGNITION involves two
steps: feature extraction and classification. A comprehensive text
about this field with respect to computer vision, has been written by
Duda and Hart(73). COMPUTER GRAPHICS is the inverse of discriptive
computer vision. The problem of computer graphics is to synthesis
images from three dimensional models; the problem of discriptive
computer vision is to analyze images into three dimensional models.
An introductory text book about this field would be that of Newman
and Sproull(73). Finally, there is ARTIFICIAL INTELLIGENCE, which in
my opinion is an institution sheltering a heterogenous group of
embryonic computer subjects; the biggest of the present day orphans
include: robotics, natural language, theorem proving, speech
analysis, vision and planning. A more narrow and relevant definition
of artificial intelligence is that it concerns the programming of the
robot task processor which sits above the vision system. There is no
general reference on Artificial Intelligence that I wish to
recommend.
The related vision work of specific individuals has already
been mention in context. To summarize, my vision work is related to:
Early: Roberts(63), Sutherland(63); Stanford: Falk, Feldman and
Paul(67) Tenenbaum(72), Agin(72), Grape(73); MIT: Guzman, Horn,
Waltz, Krakaurer;UTAH: Warnock, Watkins; other places: SRI and JPL.
Future progress in computer vision will proceed in step with
better computer hardware, better computer graphics software, and
better world modeling software. Future vision work at Stanford,
which is related to the present theory will be done by Lynn Quam and
Hans Morevac. At JPL and SRI, similar work on vehicle vision work is
being done.
The machine assembly task is being pursued both by the
Artificial Intelligence Group of the Stanford Research Institute and
by the Hand Eye Project at Stanford University. Because the demand
for doing practical vision tasks can be satisfied with existing ad
hoc methods or by not using a visual sensor at all; I expect little
or no vision progress per se from such reseach, although their
demonstrations should be robotic spectaculars.
Since, the missing ingredient for computer vision is the
spatial modeling to which perceive images can be related; I believe
that the development of the technology for generating commercial film
and television by computer for entertainment will make significant
contribution to computer vision.{Q}
[6.6 Summary.]
To recapitulate, three vision system design requirements were
postulated: reality, generality, and continuity. These requirements
were illustrated by discussing a number of vision related tasks.
Next, a vision system was described as mediating between 2-D images
and a world model; with the world model being further broken down
into a 3-D geometric model and a task world model. Between these
entities three basic vision modes were identified: recognition,
verification and revelation (description). Finally, the general
purpose vision system was depicted as a quantitative and description
oriented feedback cycle which maintain a 3-D geometric model for the
sake of higher qualitative, symbolic, and recognition oriented task
processors.
Approaching the vision system in greater detail; the role of
seven (or so) essential kinds of processors were explained: the task
processor, 3-D modeling routines, reality simulator, image
analyser, image synthesizer, comparators, and locus solvers. The
processors and data types were assembled into a cart chauffer system.
Computer vision is related to (if not contained in) image processing,
pattern recognition, computer graphics and artificial intelligence.